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GEOMETRY  OE  THE  EAITHEULNESS  ASSUMPTION  IN 
CAUSAL  INEERENCEi 

By  Caroline  Uhler,  Garvesh  Raskutti,  Peter  Buhlmann 

AND  Bin  Yu 

1ST  Austria,  SAMSI,  ETH  Zurich  and  University  of  California,  Berkeley 

Many  algorithms  for  inferring  causality  rely  heavily  on  the  faithfulness 
assumption.  The  main  justification  for  imposing  this  assumption  is  that  the  set 
of  unfaithful  distributions  has  Lebesgue  measure  zero,  since  it  can  be  seen  as 
a  collection  of  hypersurfaces  in  a  hypercube.  However,  due  to  sampling  er¬ 
ror  the  faithfulness  condition  alone  is  not  sufficient  for  statistical  estimation, 
and  strong-faithfulness  has  been  proposed  and  assumed  to  achieve  uniform 
or  high-dimensional  consistency.  In  contrast  to  the  plain  faithfulness  assump¬ 
tion,  the  set  of  distributions  that  is  not  strong-faithful  has  nonzero  Lebesgue 
measure  and  in  fact,  can  be  surprisingly  large  as  we  show  in  this  paper.  We 
study  the  strong-faithfulness  condition  from  a  geometric  and  combinatorial 
point  of  view  and  give  upper  and  lower  bounds  on  the  Lebesgue  measure  of 
strong-faithful  distributions  for  various  classes  of  directed  acyclic  graphs.  Our 
results  imply  fundamental  limitations  for  the  PC-algorithm  and  potentially 
also  for  other  algorithms  based  on  partial  correlation  testing  in  the  Gaussian 
case. 


1.  Introduction.  Determining  causal  structure  among  variables  based  on  ob¬ 
servational  data  is  of  great  interest  in  many  areas  of  science.  While  quantifying  as¬ 
sociations  among  variables  is  well-developed,  inferring  causal  relations  is  a  much 
more  challenging  task.  A  popular  approach  to  make  the  causal  inference  problem 
more  tractable  is  given  by  directed  acyclic  graph  (DAG)  models,  which  describe 
conditional  dependence  information  and  causal  structure. 

A  DAG  G  =  (V,  E)  consists  of  a  set  of  vertices  V  and  a  set  of  directed  edges 
E  such  that  there  is  no  directed  cycle.  We  index  V  =  {l,2,...,p}  and  consider 
random  variables  {A,-  \  i  —  p}  associated  to  the  nodes  V.  We  denote  a  di¬ 

rected  edge  from  vertex  i  to  vertex  j  by  (/,  j)  or  i  — >  j .  In  this  case,  i  is  called 
a  parent  of  j  and  j  is  called  a  child  of  i.  If  there  is  a  directed  path  i  j, 

then  j  is  called  a  descendent  of  i  and  i  an  ancestor  of  j .  The  skeleton  of  a  DAG 
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G  is  the  undirected  graph  obtained  from  G  hy  substituting  directed  edges  by  undi¬ 
rected  edges.  Two  nodes  which  are  connected  by  an  edge  in  the  skeleton  of  G  are 
called  adjacent,  and  a  triple  of  nodes  (/,  j,  k)  is  an  unshielded  triple  if  i  and  j  are 
adjacent  to  k  but  i  and  j  are  not  adjacent.  An  unshielded  triple  (i,  j,  k)  is  called  a 
v-structure  if  /  — ^  k  and  j  — ^  k.  In  this  case,  k  is  called  a  collider. 

The  problem  of  estimating  a  DAG  from  the  observational  distribution  is  ill- 
posed  due  to  nonidentifiability:  in  general,  several  DAGs  encode  the  same  condi¬ 
tional  independence  (Cl)  relations  and  therefore,  the  true  underlying  DAG  cannot 
be  identified  from  the  observational  distribution.  However,  assuming  faithfulness 
(see  Definition  1.1),  the  Markov  equivalence  class,  that  is,  the  skeleton  and  the 
set  of  u-structures  of  a  DAG,  is  identifiable  (cf.  [9],  Theorem  5.2.6),  making  it 
possible  to  infer  some  bounds  on  causal  effects  [8].  We  focus  here  on  the  prob¬ 
lem  of  estimating  the  Markov  equivalence  class  of  a  DAG  and  argue  that,  even  in 
the  Gaussian  case,  severe  complications  arise  for  data  of  finite  (or  asymptotically 
increasing)  sample  size. 

There  has  been  a  substantial  amount  of  work  on  estimating  the  Markov  equiv¬ 
alence  class  in  the  Gaussian  case  [3,  5,  11,  12].  Algorithms  which  are  based  on 
testing  Cl  relations  usually  must  require  the  faithfulness  assumption  (cf.  [12]): 

Definition  1.1.  A  distribution  P  is  faithful  to  a  DAG  G  if  no  Cl  relations 
other  than  the  ones  entailed  by  the  Markov  property  are  present. 

This  means  that  if  a  distribution  P  is  faithful  to  a  DAG  G,  all  conditional  (in-) 
dependences  can  be  read-off  from  the  DAG  G  using  the  so-called  J-separation 
rule  (cf.  [12]).  Two  nodes  i,  j  are  d-separated  given  S  if  every  path  between  i  and 
j  contains  a  noncollider  that  is  in  5  or  a  collider  that  is  neither  in  S  nor  an  ancestor 
of  a  node  in  S.  For  Gaussian  models,  the  faithfulness  assumption  can  be  expressed 
in  terms  of  the  J-separation  rule  and  conditional  correlations  as  follows. 

Definition  1.2.  A  multivariate  Gaussian  distribution  P  is  said  to  be  faithful 
to  a  DAG  G  =  {V,  E)  if  for  any  i,  J  eV  and  any  S  CV  \  {i,  j}: 

j  is  c?-separated  from  i  \  S  4=^  corr(2L,,  Xj  \  Xs)  =  0. 

The  main  justification  for  imposing  the  faithfulness  assumption  is  that  the  set 
of  unfaithful  distributions  to  a  graph  G  has  measure  zero.  However,  for  data  of 
finite  sample  size  estimation  error  issues  come  into  play.  Robins  et  al.  [11]  showed 
that  many  causal  discovery  algorithms,  and  the  PC-algorithm  [12]  in  particular, 
are  pointwise  but  not  uniformly  consistent  under  the  faithfulness  assumption.  This 
is  because  it  is  possible  to  create  a  sequence  of  distributions  that  is  faithful  but 
arbitrarily  close  to  an  unfaithful  distribution.  As  a  result,  Zhang  and  Spirtes  [16] 
defined  fhe  strong-faithfulness  assumption  for  the  Gaussian  case,  which  requires 
sufficiently  large  nonzero  partial  correlations. 
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Definition  1.3.  Given  X  e  (0, 1),  a  multivariate  Gaussian  distribution  P  is 
said  to  be  X-strong-faithful  to  a  DAG  G  —  {V ,  E)  if  for  any  i,  j,  e  V  and  any 
ScV\{iJ}: 

j  is  ci-separated  from  i  \  S  4=^  |corr(Z,,  Xj  \  X5)|  <  X. 

Tbe  assumption  of  A-strong-faithfulness  is  equivalent  to  requiring 

min{|corr(A',',  Xj  \  X5)|,  j  not  J-separated  from  i  \  S,  Vi,  j,  S]  >  X. 

This  motivates  our  next  definition  which  is  weaker  than  strong-faithfulness. 

Definition  1.4.  Given  X  e  (0, 1),  a  multivariate  Gaussian  distribution  P  is 
said  to  be  restricted  X-strong-faithful  to  a  DAG  G  =  {V,  E)  if  both  of  the  following 
hold: 

(i)  min{ I  corr(Z/ ,  Zj  |  Z5)|,(i,y)  e  E,  S  C  V  \  {i,  j}  such  that  151  < 
deg(G)}  >  X,  where  here  and  in  the  sequel,  deg(G)  denotes  the  maximal  degree 
(i.e.,  sum  of  indegree  and  outdegree)  of  nodes  in  G; 

(ii)  min{  I  corr(Z, ,  Xj  \  Z5)|,  (/,  j,  S)  e  No}  >  X,  where  No  is  the  set  of  triples 
(i,  j,  S)  such  that  i,  j  are  not  adjacent  but  there  exists  k  e  V  making  (/,  j,  k)  an 
unshielded  triple,  and  i,  j  are  not  <i-separated  given  S. 

The  first  condition  (i)  is  called  adjacency-faithfulness  in  [17],  the  second  condi¬ 
tion  (ii)  is  called  orientation-faithfulness.  If  a  multivariate  Gaussian  distribution  P 
satisfies  adjacency-faifhfulness  wifh  respecf  fo  a  DAG  G,  we  call  fhe  disfribufion 
X-adjacency-faithful  fo  G.  Obviously,  resfricted  A-sfrong  faifhfulness  is  a  weaker 
assumpfion  than  A-strong-faithfulness. 

We  now  briefly  discuss  the  relevance  of  these  conditions  and  their  use  in 
previous  work.  Zhang  and  Spirtes  [16]  proved  uniform  consistency  of  the  PC- 
algorithm  under  the  strong-faithfulness  assumption  with  A  x  1  / for  the  low¬ 
dimensional  case  where  the  number  of  nodes  p  =  |  P  |  is  fixed  and  sample  size 
n  ^  oo.  In  a  high-dimensional  and  sparse  seffing,  Kalisch  and  Biihlmann  [5] 
require  strong-faifhfulness  wifh  X„  x  Vdeg(G)  log(p) / n  (fhe  assumpfion  in  [5] 
is  slighfly  sfronger,  buf  can  be  relaxed  as  indicated  here).  Importantly,  since 
corr(Z,',  Xj  \  Xs)  is  required  to  be  bounded  away  from  0  by  A  for  vertices  that 
are  not  J-separated,  the  set  of  distributions  that  is  not  A-strong-faithful  no  longer 
has  measure  0. 

It  is  easy  to  see,  for  example,  from  the  proof  in  [5]  that  restricted  A-strong- 
faithfulness  is  a  sufficient  condition  for  consistency  of  the  PC-algorithm  in  the 
high-dimensional  scenario  [with  A  x  Vdeg(G)  log(p)/n]  and  that  the  condition 
is  also  sufficient  and  essentially  necessary  for  consistency  of  the  PC-algorithm. 
Furthermore,  part  (i)  of  the  restricted  strong-faithfulness  condition  is  sufficient 
and  essentially  necessary  for  correctness  of  the  conservative  PC-algorithm  [17], 
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where  correctness  refers  to  the  property  that  an  oriented  edge  is  correctly  ori¬ 
ented  hut  there  might  he  some  nonoriented  edges  which  could  he  oriented  (i.e., 
the  conservative  PC-algorithm  may  not  he  fully  informative).  The  word  “essen¬ 
tially”  above  means  that  we  may  consider  too  many  possible  separation  sets  S 
where  151  <  deg(G),  while  the  necessary  collection  of  separating  sets  S  which 
the  (conservative)  PC-algorithm  has  to  consider  might  be  a  little  bit  smaller.  Nev¬ 
ertheless,  these  differences  are  minor  and  we  should  think  of  part  (i)  of  the  re¬ 
stricted  strong-faithfulness  assumption  as  a  necessary  condition  for  consistency  of 
the  conservative  PC-algorithm  and  both  parts  (i)  and  (ii)  as  a  necessary  condition 
for  consistency  of  the  PC-algorithm. 

There  are  no  known  upper  and  lower  bounds  for  the  Lebesgue  measure  of 
X-strong-unfaithful  distributions  or  of  restricted  A-strong-unfaithful  distributions. 
Since  these  assumptions  are  so  crucial  to  inferring  structure  in  causal  networks  it 
is  vital  to  understand  if  restricted  and  plain  A-strong-faithfulness  are  likely  to  be 
satisfied. 

In  this  paper,  we  address  the  question  of  how  restrictive  the  (restricted)  strong¬ 
faithfulness  assumption  is  using  geometric  and  combinatorial  arguments.  In  par¬ 
ticular,  we  develop  upper  and  lower  bounds  on  the  Lebesgue  measure  of  Gaussian 
distributions  that  are  not  A-strong-faithful  for  various  graph  structures.  By  noting 
that  each  Cl  relation  can  be  written  as  a  polynomial  equation  and  the  unfaithful 
distributions  correspond  to  a  collection  of  real  algebraic  hypersurfaces,  we  exploit 
results  from  real  algebraic  geometry  to  bound  the  measure  of  the  set  of  strong- 
unfaithful  distributions.  As  we  demonstrate  in  this  paper,  the  strong-faithfulness 
assumption  is  restrictive  for  various  reasons.  First,  the  number  of  hypersurfaces 
corresponding  to  unfaithful  distributions  may  be  quite  large  depending  on  the 
graph  structure,  and  each  hypersurface  fills  up  space  in  fhe  hypercube.  Secondly, 
fhe  hypersurfaces  may  be  defined  by  polynomials  of  high  degrees  depending  on 
fhe  graph  sfrucfure.  The  higher  fhe  degree,  fhe  greater  fhe  curvafure  and  fhere- 
fore  fhe  surface  area  of  fhe  corresponding  hypersurface.  Finally,  fo  gel  fhe  sel  of 
A-slrong-unfailhful  dislribulions,  Ihese  hypersurfaces  gel  fallened  up  by  a  faclor 
which  depends  on  fhe  size  of  A. 

Our  resulls  show  lhal  fhe  sel  of  dislribulions  lhal  do  nol  satisfy  slrong- 
failhfulness  can  be  surprisingly  large  even  for  small  and  sparse  graphs  [e.g., 
10  nodes  and  an  expected  neighborhood  (adjacency)  size  of  2]  and  small  values 
of  A  such  as  A  =  0.01.  This  implies  fundamenlal  limilalions  for  Ihe  PC-algorilhm 
[12]  and  possibly  also  for  olher  algorilhms  based  on  partial  correlations.  Olher 
inference  melhods,  which  are  nol  based  on  conditional  independence  testing  (or 
partial  correlation  testing),  have  been  described.  The  penalized  maximum  likeli¬ 
hood  estimator  [3]  is  an  example  of  such  a  melhod  and  consistency  resulls  wilhoul 
requiring  slrong-failhfulness  have  been  given  for  Ihe  high-dimensional  and  sparse 
selling  [15].  This  melhod  requires,  however,  a  differenl  and  so-called  permulalion 
bela-min  condition,  and  il  is  nonlrivial  to  undersland  how  Ihe  slrong-failhfulness 
condition  and  Ihis  new  condition  inleracl  or  relate  to  each  olher. 
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The  remainder  of  this  paper  is  organized  as  follows:  Section  2  presents  a  simple 
example  of  a  3 -node  fully  connected  DAG,  where  we  explicitly  list  the  polyno¬ 
mial  equations  defining  the  hypersurfaces  and  plot  the  parameters  corresponding 
to  unfaithful  distrihutions.  In  Section  3,  we  define  fhe  general  model  for  a  DAG 
on  p  nodes  and  give  a  precise  descripfion  of  fhe  problem  of  hounding  fhe  mea¬ 
sure  of  disfrihufions  fhaf  do  nof  safisfy  strong-faifhfulness  for  general  DAGs.  In 
Secfion  4,  we  provide  an  algebraic  descripfion  of  fhe  unfaifhful  disfrihufions  as  a 
collecfion  of  hypersurfaces  and  give  a  combinaforial  descripfion  of  fhe  defining 
polynomials  in  ferms  of  pafhs  along  fhe  graph.  Secfion  5  provides  a  general  upper 
bound  on  fhe  measure  of  A,-sfrong-unfaifhful  disfrihufions  and  lower  bounds  for 
various  classes  of  DAGs,  namely  DAGs  whose  skelefons  are  frees,  cycles  or  bipar- 
fife  graphs  K2^p-2-  Finally,  in  Secfion  6,  we  provide  simulafion  resulfs  fo  validafe 
our  fheorefical  bounds. 

2.  Example:  3-node  fully-connected  DAG.  In  fhis  secfion,  we  mofivafe  fhe 
analysis  in  fhis  paper  using  a  simple  example  involving  a  3-node  fully-connecfed 
DAG.  The  graph  is  shown  in  Figure  1 .  We  demonsfrafe  fhaf  even  in  fhe  3-node  case, 
fhe  sfrong-faifhfulness  condition  may  be  quife  restrictive.  We  consider  a  Gaussian 
distribution  which  satisfies  fhe  direcfed  Markov  property  wifh  respecf  fo  fhe  3-node 
fully-connecfed  DAG.  An  equivalenf  model  formulafion  in  ferms  of  a  Gaussian 
sfrucfural  equafion  model  is  given  as  follows: 

Zi  =ei, 

X2  =  CI12X1  +  82, 

X2  =  a\2,X\  +  U23^2  +  £3. 

where  (ei,  £2,  £3)  ~  AA(0, 1)?  The  paramefers  a\2,  ai3  and  ^23  reflecl  fhe  causal 
sfrucfure  of  fhe  graph.  Whefher  fhe  paramefers  are  zero  or  nonzero  defermines  fhe 
absence  or  presence  of  a  direcfed  edge. 


1 


Fig.  1.  Motivating  example:  3-node  graph. 


^The  assumption  of  var(ey)  =  1  is  obviously  restricting  the  class  of  Gaussian  DAG  models.  We 
refer  to  the  more  general  discussion  on  this  issue  in  Section  7. 


GEOMETRY  OF  FAITHFULNESS  ASSUMPTION  IN  CAUSAL  INFERENCE  44 1 


It  is  well  known  that  through  observing  only  covariance  information  it  is  not 
always  possible  to  infer  causal  structure.  In  this  example,  the  pairwise  marginal 
and  the  conditional  covariances  are  as  follows: 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 


cov(Xi,  X2)  =  an, 
cov(Xi,  X3)  =  an  +  ana23, 
cov(X2,  Xj)  =  a^2‘^23  +  anan  +  a23, 


COV(Xi,  X2  I  Xo,)  =  fll3«23  —  <^12, 


cov(Xi,  X3  I  X2)  =  -an, 
cov(X2,  X3  I  Xi)  =  -a23- 


If  it  were  known  a  priori  that  the  temporal  ordering  of  the  DAG  is  (Xi ,  X2,  ^3), 
the  problem  of  inferring  the  DAG-structure  would  reduce  to  a  simple  estimation 
problem.  We  would  only  need  information  about  the  (non-)  zeroes  of  cov(Ai,  X2), 
cov(Zi,  A3  I  X2)  and  cov(A2,  A3  |  Ai),  that  is,  information  whether  the  single 
edge  weights  an,  an  and  ^23  are  zero  or  not,  which  is  a  standard  hypothesis  test¬ 
ing  problem.  In  particular,  issues  around  (strong-)  faithfulness  would  not  arise. 
However,  since  the  causal  ordering  of  the  DAG  is  unknown,  algorithms  based 
on  conditional  independence  testing,  which  amount  to  testing  partial  correlations 
or  conditional  covariances,  require  that  we  check  all  partial  correlations  between 
two  nodes  given  any  subset  of  remaining  nodes',  a  prominent  example  is  the  PC- 
algorithm  [12].  For  instance  for  the  3-node  case,  the  PC-algorithm  would  infer 
that  there  is  an  edge  between  nodes  1  and  2  if  and  only  if  cov(Ai,  A2)  7^  0  and 
cov(Ai ,  A2  I  A3)  7^  0.  The  issue  of  faithfulness  comes  into  play,  because  it  is  pos¬ 
sible  that  all  causal  parameters  an,  an  and  ^23  are  nonzero  while  cov(Ai,  A2  | 
A3)  =  0,  simply  setting  an  =  <313^^23  in  (4). 

Since  in  this  example  no  Cl  relations  are  imposed  by  the  Markov  property,  a  dis¬ 
tribution  P  is  unfaithful  to  G  if  any  of  the  polynomials  in  (l)-(6)  [corresponding 
to  (conditional)  covariances]  are  zero.  Therefore,  the  set  of  unfaithful  distributions 
for  the  3-node  example  is  the  union  of  6  real  algebraic  varieties,  namely  the  three 
coordinate  hyperplanes  given  by  (1),  (5)  and  (6),  two  real  algebraic  hypersurfaces 
of  degree  2  given  by  (2)  and  (4),  and  one  real  algebraic  hypersurface  of  degree  3 
given  by  (3). 

Assuming  that  the  causal  parameters  lie  in  the  cube  (an,  an,  ^23)  G  [—1,  l]^^ 
we  use  surf  ex,  a  software  for  visualizing  algebraic  surfaces,  to  generate  a  plot  of 
the  set  of  parameters  leading  to  unfaithful  distributions.  Figure  2(a)-(c)  shows  the 
nontrivial  hypersurfaces  corresponding  to  cov(Ai,  A3)  =  0,  cov(Ai,  X2  \  A3)  =  0 
and  cov(A2,  A3)  =  0.  Figure  2(d)  shows  a  plot  of  the  union  of  all  six  hypersur¬ 
faces. 

It  is  clear  that  the  set  of  unfaithful  distributions  has  measure  zero.  However, 
due  to  the  curvature  of  the  varieties  and  the  fact  that  we  are  taking  a  union  of 
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(a)  cov(Xi, X3)  =  0  (b)  cov(Xi,  A’2|A’3)  =  0  (c)  cov(X2, X3)  =  0  (d)  All  6  surfaces 

Fig.  2.  Parameter  values  corresponding  to  unfaithful  distributions  in  the  3-node  case. 


6  varieties,  the  chance  of  being  “close”  to  an  unfaithful  distrihution  is  quite  large. 
As  discussed  earlier,  being  close  to  an  unfaithful  distribution  is  of  great  concern 
due  to  sampling  error.  Hence,  the  set  of  distributions  that  does  not  satisfy  A-strong- 
faithfulness  is  of  interest.  As  a  direct  consequence  of  Definition  1.3,  this  set  of 
distributions  corresponds  to  the  set  of  parameters  satisfying  at  least  one  of  the 
following  inequalities: 


|cov(Zi,X2) 
|cov(Zi,X3) 
|cov(Z2,  X3) 
|cov(Zi,A2|X3) 
|cov(Zi,A3|X2) 
|cov(Z2,  X3  I  Xi) 


<  AVvar(Xi)  var(A2), 

<  AVvar(Xi)  var(A3), 

<  X^/var(X2)  var(A3), 


<A^var(Xi  |Z3)var(X2|Z3), 

<Ayvar(Xi  |Z2)var(X3|X2), 


<AJvar(X2|Xi)var(X3|Xi). 


The  set  of  parameters  (ui2,  ^13,  ^23)  satisfying  any  of  the  above  relations  for 
A  G  (0,  1)  has  nontrivial  volume.  As  we  show  in  this  paper,  the  volume  of  the 
distributions  that  are  not  A-strong-faithful  grows  as  the  number  of  nodes  and  the 
graph  density  grow  since  both  the  number  of  varieties  and  the  curvature  of  the 
varieties  increase. 


3.  General  problem  setup.  Consider  a  DAG  G.  Without  loss  of  generality, 
we  assume  that  the  vertices  of  G  are  topologically  ordered,  meaning  that  i  <  j  for 
all  (i,  j)  G  E.  Each  node  i  in  the  graph  is  associated  with  a  random  variable  A,  . 
Given  a  DAG  G,  the  random  variables  A,-  are  related  to  each  other  by  the  following 
structural  equations: 

(7)  Xj  =  Y,aijXi+Sj,  i  =  \,l,...,p, 

i<j 

where  e  =  (ei,  £2, . . . ,  Sp)  ~  Af(0, 1)  (see  footnote  2)  and  G  [—1,  +1]  are  the 
causal  parameters  with  aij  7^  0  if  and  only  if  (i,  j)  G  E.  As  we  will  see  later,  we 
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can  easily  generalize  our  results  to  a  rescaling  of  the  parameter  cube.  In  matrix 
form,  these  equations  can  be  expressed  as 

(I-AfX  =  e, 

where  X  =  (X\,  X2, .  ■ . ,  Xp)  and  A  G  is  an  upper  triangular  matrix  with 

Aij  =  Uij  for  i  <  j.  Since  e  ~  Af{0,  /), 

(8)  Z^AA(0,[(/-A)(/-A)n“Y 

We  will  exploit  the  distributional  form  (8)  for  bounding  the  volume  of  the  sets 
i‘^ij){ij)eE  £  [—1,+!]'^'  that  correspond  to  Gaussian  distributions  that  are  not 
(restricted)  A-strong-faithful. 

Given  (i,  j)  eV  x  V  with  i  ^  j  and  S  C  V  \  [i,  j},  we  define  the  set 

•=  e  [-1,+!]""  I  \coyiXi,Xj  \  X5)| 

<Ayvar(X;  |Xs)var(X,-  |Zs)}. 

The  set  of  parameters  corresponding  to  distributions  that  are  not  A,-strong-faithful 
is 


Mg.x  :=  U 

iJeV,ScV\li,jV. 
j  not  4 -separated  from  !|S 

The  set  of  parameters  corresponding  to  distributions  that  are  not  restricted  X- 
strong-faithful  is  given  by 

■=  U  tA' 

i,jeV,ScV\{i,j]: 

where  denotes  the  set  of  triples  (i,  j,  S),  S  C  V  \  {i,  j}  with  |5|  <  deg(G), 
satisfying  either  (/,  j)  e  E  or  i,  j  are  not  J-separated  given  S  and  not  adjacent 
but  there  exists  k  ^  V  making  (i,  j,  k)  an  unshielded  triple.  The  set  of  parame¬ 
ters  corresponding  to  distributions  that  are  not  A-adjacency-faithful  [see  part  (i)  of 
Definition  1 .4]  is  given  by 

x'gl  :=  U  tA' 

i,ieV,SGV\{i,iy. 

where  Nq'^  denotes  the  set  of  triples  (i,  j,  S),  S  C  V  \  {i,  j}  with  |5|  <  deg(G), 
satisfying  (i,  j)  G  E. 
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Our  goal  is  to  provide  upper  and  lower  bounds  on  the  volume  of  -^g  i 

and  Mq  \  relative  to  the  volume  of  [— 1,  that  is,  to  provide  upper  and  lower 
bounds  for 

voKMga)  .  vol(AA^^[)  vol(AA®^) 

- -7^7] -  and  — —  and  — - . 

2\e\  2\^\  21^1 

This  is  the  probability  mass  of  M.G,k,  -^g  i  ■^g  \  parameters  (aij)(ij)eE 
are  distributed  uniformly  in  [— l,+l]l^l,  which  we  will  assume  throughout  the 
paper. 

4.  Algebraic  description  of  unfaithful  distributions.  In  this  section,  we  first 
explain  that  the  unfaithful  distributions  can  always  be  described  by  polynomials  in 
the  causal  parameters  (aij)(ij)eE  and  therefore  correspond  to  a  collection  of  hy¬ 
persurfaces  in  the  hypercube  [— 1,  +1]I^L  We  then  give  a  combinatorial  descrip¬ 
tion  of  these  defining  polynomials  in  terms  of  paths  in  the  underlying  graph.  The 
proofs  can  be  found  in  Section  8. 

Proposition  4.1.  Let  i,  j  e  V,  S  V  \  {/,  j}  and  Q  =  SU  {/, ;}.  All  Cl 
relations  in  model  (7)  can  be  formulated  as  polynomial  equations  in  the  entries  of 
the  concentration  matrix  K  —  {I  —  A)(I  —  A)^ ,  namely. 

(i)  A/  X  Xj  ^  {C{K))ij  =  0, 

(ii)  A,-  X  Xj  I  Av\|,- ,-j  ^  Kij  =  0, 

(iii)  A;  X  Xj  I  A5  ^  d&iiKQCQc)Kij  -  KiQcC{KQCQc)KQCj  =  0, 
where  C(B)  denotes  the  cofactor  matrix  of 

We  now  give  an  interpretation  of  the  polynomials  defining  the  hypersurfaces 
corresponding  to  unfaithful  distributions  in  directed  Gaussian  graphical  models 
as  paths  in  the  skeleton  of  G.  The  concentration  matrix  K  can  be  expanded  as 
follows: 

K  =  {I  -A){I  -Af 
=  7  -  A- AA^. 

This  decomposition  shows  that  the  entry  Kjj ,  i  j ,  corresponds  to  the  sum  of  all 
paths  from  i  to  j  which  lead  over  a  collider  k  minus  the  direct  path  from  i  to  j  if 
j  is  a  child  of  i,  that  is, 

(9)  Kij  =  ^  ^  aikaj]^  aij. 


^The  (i,  7)th  cofactor  is  defined  as  C{K)ij  =  (— I)'“*“iM,y  where  Mjj  is  the  (i,  ytth  minor  of  K, 
that  is,  Mij  =  det(A(— i,  —  j)),  where  A(—i,  —  j)  is  the  submatrix  of  A  obtained  by  removing  the  ith 
row  and  yth  column  of  A. 
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Note  that  a, j  is  zero  in  the  case  that  j  is  not  a  child  of  i . 

For  the  covariance  matrix  E  =  the  equivalent  result  descrihing  the  path 
interpretation  is  given  in  [14],  equation  (1),  namely 

(10)  (A^y-A’. 

k=0  r-\-s=k 
r,s<p—  1 

We  give  a  proof  using  Neumann  power  series  in  Section  8. 

Equation  (10)  shows  that  the  (i,  7)th  entry  of  S  corresponds  to  all  paths  from 
i  to  j ,  which  first  go  backwards  until  they  reach  some  vertex  k  and  then  forwards 
to  j.  Such  paths  are  called  treks  in  [14].  In  other  words,  corresponds  to  all 
collider-free  paths  from  i  to  j . 

We  now  understand  the  covariance  between  two  variables  Xj  and  Xj  and  the 
conditional  covariance  when  conditioning  on  all  remaining  variables  in  terms  of 
paths  from  i  to  j.  In  the  following,  we  will  extend  these  results  to  conditional 
covariances  between  Xj  and  X j  when  conditioning  on  a  subset  S  C  \  a,  j}- 
This  means  that  we  need  to  find  a  pafh  descripfion  of 

(11)  Pij\s  :=  det{KQCQc)Kij  —  KiQcC{KQCQc)KQcj 

[see  Proposition  4.1(iii)]  and  Iherefore  of  fhe  determinant  and  the  cofactors  of 

KqcQc. 

Ponstein  [10]  gave  a  beautiful  path  description  of  det(A/  —  M)  and  the  cofactors 
of  XI  —  M,  where  M  denotes  a  variable  adjacency  matrix  of  a  not  necessarily 
acyclic  directed  graph.  By  replacing  M  by  A  +  A^  —  A A^,  that  is  by  symmetrizing 
the  graph  and  reweighting  the  directed  edges,  we  can  apply  Ponstein’s  theorem. 

Ponstein’ s  theorem.  Let  i,  j  e  V,  S  C  V  \  {i,  j}  and  g  =  5  U  [/,  j}  and 
let  G  denote  the  weighted  directed  graph  corresponding  to  the  adjacency  matrix 
A  +  A^  —  AA^  and  Gqc  the  subgraph  resulting  from  restricting  G  to  the  vertices 
in  Q‘^ .  Then'. 

(i)  dtiiKQCQc)  =  1  +  Emi-r..-rm,=)t(-l)V(cmi)  •  •  •  /x(cmj, 

(ii)  (C{KQcQc))jj  =  E/t=2  EmoH - \-ms=k—\i~^y '  '  '  diCms)’ 

fori^j, 

where  ixidinf)  denotes  the  product  of  the  edge  weights  along  a  self-avoiding  path 
from  i  to  j  in  Gqc  of  length  mo,  /x(Cmi),  •  •  • ,  dicing)  denote  the  product  of  the  edge 
weights  along  self-avoiding  cycles  in  Gqc  of  lengths  m\, . . . ,  m^,  respectively,  and 
dniQ,  c,„i,  ■  ■  ■ ,  Cm,;  are  disjoint  paths. 

Putting  together  the  various  pieces  in  (11),  namely  equation  (9)  for  describing 
Kqq,  Kqqc  and  Kqcq,  and  Ponstein’s  theorem  for  det(.K'gcQc)  and  C{Kqcqc), 
we  get  a  path  interpretation  of  all  partial  correlations. 
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Fig.  3.  Directed  tree,  cycle  and  bipartite  graph. 

Example  4.2.  For  the  special  case  where  the  underlying  DAG  is  fully  con¬ 
nected  and  we  condition  on  all  hut  one  variable,  that  is,  5  =  V  \  {/,  j,  5},  the  repre¬ 
sentation  of  the  conditional  correlation  between  Xj  and  Xj  when  conditioning  on 
A5  in  terms  of  paths  in  G  is  given  by 

(1+  X/  X/  ‘dikajk—atj 

^  k-.s^k  ^  \:i^k^j 

f  ^  '  klitUst  j  f  ^  ^  kljtO-st 

In  the  following,  we  apply  equations  (9),  (10)  and  Ponstein’s  theorem  to  de¬ 
scribe  the  structure  of  the  polynomials  corresponding  to  unfaithful  distributions 
for  various  classes  of  DAGs,  namely  DAGs  whose  skeletons  are  trees,  cycles  and 
bipartite  graphs.  We  denote  by  Tp  a  directed  connected  rooted  tree  on  p  nodes, 
where  all  edges  are  directed  away  from  the  root  as  shown  in  Figure  3(a).  Fet  Cp 
denote  a  DAG  whose  skeleton  is  a  cycle,  and  ^'2,^-2  a  DAG  whose  skeleton  is  a 
bipartite  graph,  where  the  edges  are  directed  as  shown  in  Figure  3(b)  and  (c). 

We  denote  by  SOS(a)  a  sum  of  squares  polynomial  in  the  variables  (aij)qj)^E, 
meaning 

SOSia)  =  J2fha), 

k 

where  each  fk{_a)  is  a  polynomial  in  (aij)qj)^E-  The  polynomials  corresponding 
to  unfaithful  distributions  for  the  graphs  described  in  Figure  3  are  given  in  the 
following  result. 

Corollary  4.3.  Let  i,  j  &V  and  S  (Z  V  \  {/,  j}  such  that  i,  j  are  not  d- 
separated  given  S.  Then  the  polynomials  Pij\s  defined  in  (11)  corresponding  to 
the  Cl  relation  Xi  JL  Xj  \  Xg  in  model  (7)  are  of  the  following  form: 

(a)  for  G  =  Tp  : 

ai-).j  ■  (1  -|-  SOS(a)), 

where  at^j  is  a  monomial  and  denotes  the  value  of  the  unique  path  from  i  to  7; 
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(h)  for  G  =  Cp: 

Oi^j  ■  (1  +  SOS(fl))  if  p  ^  S, 
fid)aij+i  -  g(d)ajj+i  if  S  =  {p}, 

where  Oi^j  denotes  the  value  of  a  path  from  i  to  j  and  f(d),  g(d)  are  polynomials 
in  the  variables  d  =  |  (5,  t)  ^  {(/,  i  +  1),  (j,  j  +  1)}}; 

(c)  for  G  =  K2,p-2-- 


Oi^j  ■  {\  +  SOS  (a))  ifpiS, 

f(d)a\j  —  g{d)app  ifi  =  1  and  p  e  S. 


5.  Bounds  on  the  volume  of  unfaithful  distributions.  Based  on  the  path 
interpretation  of  the  partial  covariances  explained  in  the  previous  section,  we 
derive  upper  and  lower  hounds  on  the  volume  of  the  parameters  that  lead  to 
A-strong-unfaithful  distributions.  We  also  provide  hounds  on  the  proportion  of  re¬ 
stricted  A-strong-unfaithful  distributions.  These  are  distributions  which  do  not  sat¬ 
isfy  the  necessary  conditions  for  uniform  or  high-dimensional  consistency  of  the 
PC-algorithm.  Our  first  result  makes  use  of  Crofton’s  formula  for  real  algebraic 
hypersurfaces  and  the  Lojasiewicz  inequality  to  provide  a  general  upper  bound  on 
the  measure  of  strong-unfaithful  distributions. 

Crofton’s  formula  gives  an  upper  bound  on  the  surface  area  of  a  real  algebraic 
hypersurface  defined  by  a  degree  d  polynomial,  namely: 

Crofton’s  formula.  The  volume  of  a  degree  d  real  algebraic  hypersur¬ 
face  in  the  unit  m-ball  is  bounded  above  by  C{m)d,  where  C{m)  satisfies 


For  more  defails  on  Croffon’s  formula  for  real  algebraic  hypersurfaces  see,  for 
example,  [2]  or  [4],  pages  45  and  46. 

The  Lojasiewicz  inequality  gives  an  upper  bound  for  the  distance  of  a  point  to 
the  nearest  zero  of  a  given  real  analytic  function.  This  is  used  as  an  upper  bound 
for  the  thickness  of  the  fattened  hypersurface. 

Lojasiewicz  inequality.  Let  f  :  R  be  a  real-analytic  function  and 

K  c  compact.  Let  L/C  denote  the  real  zero  locus  of  f,  which  is  assumed 
to  be  nonempty.  Then  there  exist  positive  constants  c,  k  such  that  for  all  x  &  K: 


dist(x,  Vf)  <  c|/(y)|^. 
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Theorem  5.1  (General  upper  bound).  Let  G  =  (V,  E)  be  a  DAG  on  p  nodes. 
Then 


^  vol(AC^^_[)  ^  voI(Mga) 
2\e\  -  21^1  “  21^1 


< 


C{\E\)ck^X'^ 

21^72 


i:  i:  deg(cov(Z,,  Xj  I  X5)), 

ijev  scv\{i,i] 


where  Cd^l)  is  a  positive  constant  coming  from  Crofton’s  formula,  c,  k  are  pos¬ 
itive  constants,  depending  on  the  polynomials  characterizing  exact  unfaithfulness 
(for  an  exact  definition,  see  the  proof),  and  k  denotes  the  maximal  partial  variance 
overall  possible  parameter  values  (asi)  G  [—1,  1]'^L  that  is. 


K=  max  max  var(X;  |  Xj). 

i,76V,5cY\|i,il(a,,)6[-l,l]|£| 


Theorem  5.1  shows  that  the  volume  of  (restricted)  A,-strong-unfaithful  distri¬ 
butions  may  be  large  for  two  reasons.  First,  the  number  of  polynomials  grows 
quickly  as  the  size  and  density  of  the  graph  increases,  and  secondly  the  degree  of 
the  polynomials  grows  as  the  number  of  nodes  and  density  of  the  graph  increases. 
The  higher  the  degree,  the  greater  the  curvature  of  the  variety  and  hence  the  larger 
the  volume  that  is  filled  according  to  Crofton’s  formula.  Unfortunately,  the  upper 
bound  cannot  be  computed  explicitly,  since  we  do  not  have  bounds  on  the  constants 
in  the  Lojasiewicz  inequality. 

Proof  OF  Theorem  5.1.  It  is  clear  that 

VO  i«7)  <  VO  l(A7i^,i)  <  vo1(Mg,a)- 
Using  the  standard  union  bound,  we  get  that 

vo1(Mg,^)  <  XI  vo1(T’^|5). 

i,jeV,ScV\{i,jV. 
j  not  d-separated  from  ;|5 

Let  Vij\s  denote  the  real  algebraic  hypersurface  defined  by  cov(2f,',  Xj  \  Xs),  thaf 
is, the  set  of  all  parameter  values  (ast)  G  [—1,+!]'^'  which  vanish  on  cov(2l,',  | 

Xs).  Hence, 

vol(iP^I5)  <  vol({(a,,)  G  [-l,+l]l^l  I  |cov(X,-,X,-  I  Xs)|  <X/r}) 

<  vol({(a^,)  G  [-l,+l]l^l  I  dist((a^f),  V,y|5)  < 

where  Cij\s,  kij\s  are  positive  constants  and  the  second  inequality  follows  from  the 
Lojasiewicz  inequality. 
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We  apply  Crofton’s  formula  on  an  |£|-dimensional  ball  of  radius  \/2  to  get  an 
upper  bound  on  tbe  surface  area  of  a  real  algebraic  hypersurface  in  the  hypercube 

[-1,  l]l^l: 

M'Pljls)  <  Oy|5A*''^l^/c^''^l^2l^l/2c(|£|)deg(cov(X,-,  X,-  |  Xs)). 


The  claim  follows  by  setting 


c=  max  Cii\s 
i.jeV,ScV\{iJ} 


and  k=  min 

iJeV,ScV\{iJ] 


□ 


The  PC-algorithm  in  practice  only  requires  A-strong-faithfulness  for  all  subsets 
5  C  V  \  {/,  7}  for  which  151  is  at  most  the  maximal  degree  of  the  graph.  This  could 
lead  to  a  tighter  upper  bound,  since  we  have  fewer  summands.  We  will  analyze  in 
Section  6  how  helpful  this  is  in  practice.  In  addition,  note  that  we  can  easily  get 
upper  bounds  for  a  general  parameter  cube  of  size  [— r,  by  applying  Crofton’s 

formula  to  a  sphere  of  radius  \/2r. 

Since  the  main  goal  of  this  paper  is  to  show  how  restrictive  the  (restricted) 
strong-faithfulness  assumption  is,  lower  bounds  on  the  proportion  of  (restricted) 
X-strong-unfaithful  distributions  are  necessary.  However,  nontrivial  lower  bounds 
for  general  graphs  cannot  be  found  using  tools  from  real  algebraic  geometry,  since 
in  the  worst  case  the  surface  area  of  a  real  algebraic  hypersurface  is  zero.  This  is 
the  case  when  the  polynomial  defining  the  hypersurface  has  no  real  roots.  In  that 
case,  the  corresponding  real  algebraic  hypersurface  is  empty.  As  a  consequence, 
we  need  to  analyze  different  classes  of  graphs  separately,  understand  the  defin¬ 
ing  polynomials,  and  find  lower  bounds  for  these  classes  of  graphs.  In  Section  4, 
we  discussed  the  structure  of  the  defining  polynomials  for  DAGs  whose  skeleton 
are  trees,  cycles  or  bipartite  graphs,  respectively.  In  the  following,  we  use  these 
results  to  find  lower  bounds  on  the  proportion  of  (restricted)  X-strong-unfaithful 
distributions  for  these  classes  of  graphs. 


Theorem  5.2  (Lower  bound  for  trees).  Let  Tp  be  a  connected  directed  tree 
on  p  nodes  with  edge  set  E  as  shown  in  Figure  3(a).  Then: 


(i) 

vol(Atr;,,A.) 

>  1 

7 

1 

2\e\ 

(ii) 

vol{AfpJp) 

>  1  - 

1 

1 

2\e\ 

(iii) 

vol{AfpJp) 

>  1  - 

1 

1 

2\e\ 

Theorem  5.2  shows  that  the  measure  of  restricted  and  ordinary  X-strong- 
unfaithful  distributions  converges  to  1  exponentially  in  the  number  p  of  nodes 
for  fixed  X  G  (0,  1).  Hence,  even  for  trees  the  strong-faithfulness  assumption  is  re¬ 
strictive  and  the  use  of  the  PC-algorithm  problematic  when  the  number  of  nodes  is 
large. 
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Proof  of  Theorem  5.2.  (i)  For  a  given  pair  of  nodes  i,  j  e  V ,  i  ^  j,  and 

subset  S  C  V  \  {i,  j}  we  want  to  lower  bound  tbe  volume  of  parameters  (ast)  G 
[— 1,  1]'^'  (in  this  example  \  E\  =  p  —  1)  for  which 

|cov(X,-,  I  Zs)|  <  X^vwiXi  I  Zs)var(Z^-  |  X5) 

or  equivalently 

<>^^]Pii\sPij\s- 

From  Corollary  4.3,  we  know  that  the  defining  polynomials  Pij\s  for  Tp  are  of  the 
form 

Ui^j  -{l  +  SOSia)). 

Similarly  as  in  Corollary  4.3,  one  can  prove  that  the  polynomials  P,,  |5  are  of  the 
form  1  +  SOS  (a)  and  can  therefore  be  lower  bounded  by  1. 

So  the  hypersurfaces  representing  the  unfaithful  distributions  are  the  coordinate 
planes  corresponding  to  the  p  —  1  edges  in  the  tree  Tp.  A  distribution  is  strong- 
unfaithful  if  it  is  near  to  any  one  of  the  hypersurfaces  (worst  case).  Since  there  is 
a  defining  polynomial  Pij\s  withouf  the  factor  consisting  of  the  sum  of  squares, 
the  A,-strong-unfaithful  distributions  correspond  to  the  parameter  values  ipst)  G 
[— 1, 1]^“^  satisfying 

I  fl,'  ^  y  I  <  A, 

for  at  least  one  pair  of  i,  j  G  V .  Since  we  are  seeking  a  lower  bound,  we  set  all 
parameter  values  to  1  except  for  one.  As  a  result,  a  lower  bound  on  the  proportion 
of  A-strong-unfaithful  distributions  is  given  by  the  union  of  all  parameter  values 
(ast)  G  [— 1, 1]^“'  such  that 

\o-st  I  A. 

We  get  a  lower  bound  on  the  volume  by  an  inclusion-exclusion  argument.  We 
first  sum  over  the  volume  of  all  by  2 A  thickened  coordinate  hyperplanes,  subtract 
all  pairwise  intersections,  add  all  three-wise  intersections,  and  so  on.  This  results 
in  the  following  lower  bound: 

vol(Af  ^  2A2P-2  /  p  -  1  \  {2X)'^2P-^ 

2\e\  -^P  2P-^  \  2.  )  2P-1 


=  1  -  (1  -  A)^“l. 
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The  proof  of  (ii)  and  (iii)  is  similar.  The  monomials  ai^j  reduee  to  single  pa¬ 
rameters  Uij,  since  the  necessary  conditions  only  involve  (i,  j)  e  E.  □ 


This  theorem  is  in  line  with  the  results  in  [1],  where  they  show  that  for  trees 
checking  if  a  Gaussian  distribution  satisfies  all  conditional  independence  relations 
imposed  hy  the  Markov  property  only  requires  testing  if  the  causal  parameters 
corresponding  to  the  edges  in  the  tree  are  nonzero. 

Note  that  the  behavior  stated  in  Theorem  5.2  is  qualitatively  the  same  as  for  a 
linear  model  Y  =  Xfi  +  e  with  active  set  5  =  {7  |  ^  0}.  To  get  consistent  esti¬ 

mation  of  S,  a  “beta-min”  condition  is  required,  namely  that  for  some  suitable  A., 

min  1/1;  I  >  A, 
ieS 

meaning  that  the  volume  of  the  problematic  set  of  parameter  values  /I  e  [— 1,  \  Y 
is  given  by 

1  -  (1 

The  cardinality  |5|  is  the  analogue  of  the  number  of  edges  in  a  DAG;  for  trees, 
the  number  of  edges  is  p  —  1  x  p  and  hence,  the  comparable  behavior  for  strong¬ 
faithfulness  of  trees  and  the  volume  of  coefficients  where  the  “beta-min”  condition 
holds. 

Using  the  lower  bound  computed  in  Theorem  5.2,  we  can  also  analyze  some 
scaling  of  n,  p  =  p„  and  deg(G)  =  deg(G„)  as  a  function  of  n,  such  that  A  =  A„- 
strong-faithfulness  holds.  This  is  discussed  in  Section  5.1. 

We  now  provide  a  lower  bound  for  DAGs  where  the  skeleton  is  a  cycle  on  p 
nodes. 


Theorem  5.3  (Lower  bound  for  cycles).  Let  Cp  be  a  directed  cycle  on  p 
nodes  with  edge  set  E  as  shown  in  Eigure  3(b).  Then'. 


(i) 

r(i) 


(ii)  >  1  -  (1  -  A)3p-2, 

(iii) 


''°kA/'c;,U 

2\e\ 


CpY 


2\e\ 


>  1  -  (1  -A)2p-i. 


For  cycles,  the  measure  of  A-strong-unfaithful  distributions  converges  to  1  expo¬ 
nentially  in  p^.  The  addition  of  a  single  cycle  significantly  increases  the  volume  of 
strong-unfaithful  distributions.  The  measure  of  restricted  A-strong-unfaithful  dis¬ 
tributions,  however,  converges  to  1  exponentially  in  3p  and  hence  shows  a  similar 
behavior  as  for  trees.  The  scaling  for  achieving  strong-faithfulness  for  cycles  is 
discussed  in  Section  5.1. 
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Proof  of  Theorem  5.3.  Similar  as  for  trees,  all  coordinate  hyperplanes  cor¬ 
respond  to  unfaithful  distrihutions.  The  corresponding  volume  of  strong-unfaithful 
distrihutions  is  2^“^  •  (2A,)  and  there  are  p  such  fattened  hyperplanes.  In  addi¬ 
tion,  there  are  (^2^)  hypersurfaces  in  the  case  of  (i),  l{p  —  1)  hypersurfaces 
for  (ii),  and  p  —  I  hypersurfaces  for  (iii)  defined  hy  polynomials  of  the  form 
/(d)<3,',;+i  -  gid)ajj+u  where  d  =  Kj  |  {s,  t)  i  {(/,  i  -|-  1),  (;,  j  -|-  1)}}.  Such 
hypersurfaces  are  equivalently  defined  hy 


g(a) 

fid) 


Since  for  any  fixed  d  G  [—1,  1]^“^  fhis  is  fhe  paramefrizafion  of  a  line,  we  can 
lower  hound  fhe  surface  area  of  this  hypersurface  hy  2^“^  •  2,  which  is  the  same 
lower  hound  as  for  a  coordinate  hyperplane.  Similarly  as  in  the  proof  for  trees,  an 
inclusion-exclusion  argument  over  all  hyperplanes  yields  the  proof.  □ 


Our  simulations  in  Section  6  show  that  hy  increasing  the  number  of  cycles  in 
the  skeleton,  the  volume  of  strong-unfaithful  distrihutions  increases  significantly. 
We  now  provide  a  lower  hound  for  DAGs  where  the  skeleton  is  a  hipartite  graph 
K2,p-2  and  therefore  consists  of  many  4-cycles.  The  corresponding  scaling  for 
strong-faithfulness  is  discussed  in  Section  5.1. 


Theorem  5.4  (Lower  hound  for  hipartite  graphs).  Let  K2^p-2  be  a  directed 
bipartite  graph  on  p  nodes  with  edge  set  E  as  shown  in  Figure  3(c).  Then: 


(i) 

(ii) 

(iii) 


voUAtifj  p_2,a) 

2\e\ 

2\e\ 


2\e\ 


>1_(1_A)(P-2)(2'’-^+1), 

>1_(1_A)(P-2)(2^-'+1). 


Proof.  The  graph  K2,p-2  has  2{p  —  2)  edges  leading  to  2(p  —  2)  hyperplanes 
of  surface  area  22(P“2)-i  addition,  there  are  {p  —  2)(2P~^  —  1)  distinct  hyper¬ 
surfaces  defined  hy  polynomials  of  the  form  f{d)a\  j  —  g{d)aj  p.  Their  surface 
area  can  he  lower  hounded  as  well  hy  as  seen  in  the  proof  of  Theo¬ 

rem  5.3.  Hence,  the  volume  of  restricted  and  ordinary  A,-strong-unfaithful  distrihu¬ 
tions  on  K2^p-2  is  hounded  helow  hy 

1  _  (1  _  yp^2(p-2)+{p-2){2P-^-\)  ^  □ 


We  remark  that  we  can  generalize  the  lower  hounds  to  a  rescaled  parameter  cube 
[— r,  r]  1^1  by  replacing  A  by  Notice  that  as  r  increases  the  lower  bounds  decrease 
but  a  very  large  value  of  r  (i.e.,  very  large  absolute  values  of  causal  parameters) 
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would  be  needed  to  achieve  sufficiently  small  lower  bounds.  Furthermore,  as  dis¬ 
cussed  in  [7],  other  factors  such  as  singularities  on  the  partial  correlation  hyper¬ 
surfaces  may  significantly  increase  the  volume  and  can  occur  anywhere  on  the 
hypersurface  depending  on  the  structure  of  the  DAG.  Therefore,  the  lower  bound 
may  not  be  tight. 

5.1.  Scaling  and  strong-faithfulness.  We  here  consider  the  setting  where  the 
DAG  G  =  Gn  and  hence  the  number  of  nodes  p  =  p„  and  the  degree  of  the  DAG 
deg(G)  =  deg(G„)  depend  on  n,  and  we  take  an  asymptotic  view  point  where 
n  oo.  In  such  a  setting,  we  focus  on  k  =  x  v^deg(G„)log(p„)/n  (see  [5]). 
We  now  briefly  discuss  when  (restricted)  A.,, -strong-faithfulness  will  asymptot¬ 
ically  hold.  For  the  latter,  we  must  have  that  the  lower  bounds  (see  Theorems 
5. 2-5.4)  on  failure  of  (restricted)  A,„ -strong-faithfulness  tend  to  zero. 

Case  I:  lower  bound  x  1  —  (1  —  A.„)P".  Such  lower  bounds  appear  for  trees  (The¬ 
orem  5.2)  as  well  as  for  restricted  strong-faithfulness  for  cycles  (Theorem  5.3).  The 
lower  bound  1  —  (1  —  A.„)^"  tends  to  zero  as  n  ^  oo  if 

Pn  =  o(  / - I  (n  — ^  oo). 

VVdeg(G„)log(n)y 

Thus,  we  have  =  o(^fn /  log(n))  for  -strong-faithfulness  for  bounded  de¬ 
gree  trees  and  for  restricted  -strong  faithfulness  for  cycles,  and  we  have  = 
o((n/log(n))'/^)  for  star- shaped  graphs. 

2 

Case  II:  lower  bound  x  I  —  (I  —  A„)^n.  Such  a  lower  bound  appears  for  strong- 

2 

faithfulness  for  cycles  (Theorem  5.3).  The  lower  bound  1  —  (1  —  A„)^n  tends  to 
zero  as  n  — ^  oo  if 

^((deg(G„)log(n))  )  in^oo). 

Therefore,  we  have  =  o((n/log(n))^/^)  for  -strong-faithfulness  for  cycles. 

Case  III:  lower  bound  x  I  —  (I  —  A,„)^^" .  This  lower  bound  appears  for  strong¬ 
faithfulness  for  bipartite  graphs  (Theorem  5.4).  This  bound  tends  to  zero  as  n  — ^  oo 
if 


p„  =  o(log(n))  (n^oo), 

regardless  of  deg(G„)  <  Thus,  for  bipartite  graphs  with  deg(G„)  =  —  2  we 

have  pn  =  o(log(n))  for  -strong-faithfulness. 

In  summary,  even  for  trees,  we  cannot  have  pn^n,  and  high-dimensional  con¬ 
sistency  of  the  PC-algorithm  seems  rather  unrealistic  (unless,  e.g.,  the  causal  pa¬ 
rameters  have  a  distribution  which  is  very  different  from  uniform). 
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6.  Simulation  results.  In  this  section,  we  describe  various  simulation  results 
to  validate  the  theoretical  bounds  described  in  the  previous  section.  For  our  simu¬ 
lations,  we  used  the  R  library  pcalg  [6]. 

In  a  first  set  of  simulations,  we  generated  random  DAGs  with  a  given  expected 
neighborhood  size  (i.e.,  expected  degree  of  each  vertex  in  the  DAG)  and  edge 
weights  sampled  uniformly  in  [—1,  1].  We  then  analyzed  how  the  proportion  of  A,- 
strong-unfaithful  distributions  depends  on  the  number  of  nodes  p  and  the  expected 
neighborhood  size  of  the  graph.  Depending  on  the  number  of  nodes  in  a  graph, 
we  analyzed  5-10  different  expected  neighborhood  sizes  and  generated  10,000 
random  DAGs  for  each  expected  neighborhood  size. 

Using  pcalg  we  computed  all  partial  correlations.  Since  this  computation  re¬ 
quires  multiple  matrix  inversions,  numerical  imprecision  has  to  be  expected.  We 
assumed  that  all  partial  correlations  smaller  than  10“^^  were  actual  zeroes  and 
counted  the  number  of  simulations,  for  which  the  minimal  partial  correlation  (af¬ 
ter  excluding  the  ones  with  partial  correlation  <  10“^^)  was  smaller  than  A.  The 
resulting  plots  of  the  proportion  of  A-strong-unfaithful  distributions  for  three  dif¬ 
ferent  values  of  A,  namely  A  =  0.1,  0.01, 0.001  are  given  in  Figure  4(a)  for  p  =  3 
nodes,  in  Figure  4(b)  for  p  =  5  nodes  and  in  Figure  4(c)  for  p  =  10  nodes. 

It  appears  that  already  for  very  sparse  graphs  (i.e.,  expected  neighborhood 
size  of  2)  and  relatively  small  graphs  (i.e.,  10  nodes)  the  proportion  of  A-strong- 
unfaithful  distributions  is  nearly  1  for  A  =  0.1,  about  0.9  for  A  =  0.01  and  about  0.7 
for  A  =  0.001.  In  addition,  the  proportion  of  A-strong-unfaithful  distributions  in¬ 
creases  with  graph  density  and  with  the  number  of  nodes  (even  for  a  fixed  expected 
neighborhood  size).  The  general  upper  bound  derived  in  Theorem  5.1  shows  simi¬ 
lar  behaviors.  The  number  of  summands  and  the  degrees  of  the  hypersurfaces  grow 
with  the  number  of  nodes  and  graph  density. 

6.1.  Bounding  the  causal  parameters  away  from  zero.  In  the  following,  we 
analyze  how  the  proportion  of  A-strong-unfaithful  distributions  changes  when  re¬ 
stricting  the  parameter  space.  The  motivation  behind  this  experiment  is  that  un¬ 
faithfulness  would  not  be  too  serious  of  an  issue  if  the  PC-algorithm  only  fails 


(a)  3-node  DAGs  (b)  5-node  DAGs 


(c)  10-node  DAGs 


Fig.  4.  Proportion  of  X-strong-unfaithful  distributions  for  3  values  ofk. 
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Expected  neighborhood  size 


Expected  neighborhood  size 


Expected  neighborhood  size 


(a)  c=0.25 


(b)  c=0.50 


(c)  c=0.75 


Eig.  5.  Proportion  of  X-strong-unfaithful  distributions  for  lO-node  DAGs  when  restricting  the  pa¬ 
rameter  space. 


to  recover  very  small  causal  effects  but  does  well  when  the  causal  parameters  are 
large.  We  repeated  the  experiments  when  restricting  the  parameter  space  to 

[-l,-c]U[c,l] 

for  c  =  0.25,  0.5  and  0.75.  The  results  for  10-node  DAGs  are  shown  in  Figure  5. 
Restricting  the  parameter  space  seems  to  help  for  sparse  graphs  but  does  not  seem 
to  play  a  role  for  dense  graphs.  We  now  analyze  various  classes  of  graphs  and  their 
behavior  when  restricting  the  parameter  space. 

6.1.1.  Trees.  We  generated  connected  trees  where  all  edges  are  directed  away 
from  the  root  by  first  sampling  the  number  of  levels  uniformly  from  {2, . . . ,  p} 
(a  tree  with  2  levels  is  a  star  graph,  a  tree  with  p  levels  is  a  line),  then  distributing 
the  p  nodes  on  these  levels  such  that  there  is  at  least  one  node  on  each  level, 
and  finally  assigning  a  unique  parenf  fo  each  node  uniformly  from  all  nodes  on  fhe 
previous  level.  The  resulting  plots  for  the  whole  parameter  space  [— 1 ,  1]  are  shown 
in  Figure  6(a).  The  plots  when  restricting  the  parameter  space  for  c  =  0.25,  0.5  and 
0.75  are  shown  in  Figure  7.  As  before,  each  proportion  is  computed  from  10,000 
simulations. 

For  trees  restricting  the  parameter  space  reduces  the  proportion  of  A-strong- 
unfaithful  distributions  by  a  large  amount.  This  can  be  explained  by  the  special 
structure  of  the  defining  polynomials  (given  in  Corollary  4.3).  Since  the  defining 
polynomials  of  fhe  partial  correlation  hypersurfaces  are  of  the  form  ■  (1  + 
SOS(a)),  the  minimal  possible  value  of  these  polynomials  when  restricting  the 
parameter  space  is 

^path  length  from  i  to  j 

6.1.2.  Cycles.  We  generated  DAGs  where  the  skeleton  is  a  cycle  and  the 
edges  are  directed  as  shown  in  Figure  3(b).  The  edge  weights  were  sampled 
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Number  of  vertices 


Cycle  length 
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Fig.  6.  Proportion  of  X-strong-unfaithful  distributions  when  the  skeleton  is  a  tree,  a  cycle  or  a 
bipartite  graph. 


uniformly  from  [—1,  — c]  U  [c,  1].  The  resulting  plots  for  the  whole  parameter 
space  are  shown  in  Figure  6(h).  The  plots  for  the  restricted  parameter  space  with 
c  =  0.25,  0.5  and  0.75  are  shown  in  Figure  8.  Again,  each  point  corresponds  to 
10,000  DAGs. 

For  cycles  restricting  the  parameter  space  also  reduces  the  proportion  of  X- 
strong-unfaithful  distrihutions,  however  not  as  drastically  as  for  trees.  This  can 
again  he  explained  hy  the  special  structure  of  the  defining  polynomials  (given 
in  Corollary  4.3).  When  the  defining  polynomials  are  of  the  form  /(«)«,_, +1  — 
g{a)ajj.y\,  they  might  evaluate  to  a  very  small  number  even  when  the  parameters 
themselves  are  large. 

6.1.3.  Bipartite  graphs.  We  generated  DAGs  where  the  skeleton  is  a  bipartite 
graph  and  the  edges  are  directed  as  shown  in  Figure  3(c).  Bipartite  graphs 

Ki.p-i  consist  of  many  4-cycles.  For  such  graphs  there  are  many  paths  from  one 
vertex  to  another  and  therefore  many  ways  for  a  polynomial  to  cancel  out,  even 


Fig.  7.  Proportion  of  X-strong-unfaithful  distributions  for  trees  when  restricting  the  parameter 
space. 


GEOMETRY  OF  FAITHFULNESS  ASSUMPTION  IN  CAUSAL  INFERENCE  457 


Cycle  length 


Cycle  length 


Cycle  length 


(a)  c=0.25 


(b)  c=0.50 


(c)  c=0.75 


Eig.  8.  Proportion  of  k-strong-unfaithful  distributions  for  cycles  when  restricting  the  parameter 
space. 


when  the  parameter  values  are  large.  As  a  consequence,  for  such  graphs  restricting 
the  parameter  space  makes  hardly  no  difference  on  the  proportion  of  k-strong- 
unfaithful  distrihutions.  This  becomes  apparent  in  Figures  6(c)  and  9. 

6.1.4.  Lower  bounds.  We  compare  the  theoretical  lower  hounds  derived  in 
Section  5  to  the  simulation  results  in  this  section  for  DAGs  where  the  skeleton 
is  a  tree,  a  cycle  or  a  bipartite  graph  when  c  =  0.  We  present  our  lower  bounds 
together  with  the  simulation  results  in  Figure  10.  The  black  lines  correspond  to  the 
lower  bounds,  the  solid  line  to  A.  =  0.1,  the  dashed  line  to  A.  =  0.01  and  the  dotted 
line  to  A  =  0.001.  In  particular  for  bipartite  graphs  our  lower  bounds  approximate 
the  simulation  results  very  well. 

6.2.  Restricted  X-strong-faithfulness.  As  already  discussed  earlier,  the  PC- 
algorithm  only  requires  the  computation  of  all  partial  correlations  over  edges  in 
the  graph  G  and  conditioning  sets  S  of  size  at  most  deg(G).  In  order  to  ana¬ 
lyze  when  the  (conservative)  PC-algorithm  works,  we  repeated  all  our  simulations 
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Eig.  9.  Proportion  of  k-strong-unfaithful  distributions  for  bipartite  graphs  ls2,p—2  when  restrict¬ 
ing  the  parameter  space. 


458 


UHLER,  RASKUTTI,  BUHLMANN  AND  YU 


(a)  trees 


(b)  cycles 


(c)  bipartite  graphs 


Fig.  10.  Comparison  of  theoretical  lower  bounds  and  approximated  proportion  of 
X-strong-unfaithful  distributions  for  trees,  cycles  and  bipartite  graphs  K2^p—2- 


when  restricting  the  partial  correlations  to  edges  in  the  graph  G  and  conditioning 
sets  S  of  size  at  most  deg(G),  that  is,  part  (i)  of  the  restricted  strong-faithfulness 
assumption  in  Definition  1.4,  called  the  adjacency-faithfulness  assumption.  The 
results  for  general  10-node  DAGs  are  shown  in  Figure  11.  We  see  that  the  pro¬ 
portion  of  A-adjacency-unfaithful  distributions  is  slightly  reduced  compared  to  the 
proportion  of  A-strong-unfaithful  distributions  shown  in  Figure  5,  in  particular  for 
sparse  graphs.  For  trees  and  bipartite  graphs  the  proportion  of  restricted  A-strong- 
unfaithful  distributions  is  similar  to  the  proportion  of  A-strong-unfaithful  distri¬ 
butions  shown  in  Figures  6,  7  and  9,  whereas  the  behavior  for  cycles  regarding 
the  proportion  of  restricted  A-strong-unfaithful  distributions  is  similar  to  trees.  We 
omit  these  plots  here,  but  remark  that  they  nicely  agree  with  the  theoretical  bounds 
for  restricted  A-strong-faithfulness  and  A-adjacency-faithfulness  derived  in  Sec¬ 
tion  5. 

7.  Discussion.  In  this  paper,  we  have  shown  that  the  (restricted)  strong¬ 
faithfulness  assumption  is  very  restrictive,  even  for  relatively  small  and  sparse 


Expected  neighborhood  size  Expected  neighborhood  size 


Expected  neighborhood  size 


(a)  c=0 


(b)  c=0.25 


(c)  c=0.75 


Fig.  1 1 .  Proportion  of  X-adjacency-unfaithful  distributions  for  10-node  DAGs. 
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graphs.  Furthermore,  the  proportion  of  strong-unfaithful  distributions  grows  with 
the  number  of  nodes  and  the  number  of  edges.  We  have  also  analyzed  the  restricted 
strong-faithfulness  assumption  introduced  by  Spirtes  and  Zhang  [17],  a  weaker 
condition  than  strong-faithfulness,  which  is  essentially  a  necessary  condition  for 
uniform  or  high-dimensional  consistency  of  the  popular  PC-algorithm  and  of  the 
conservative  PC-algorithm.  As  seen  in  this  paper,  our  lower  bounds  on  restricted 
strong-unfaithful  distributions  are  similar  to  our  bounds  for  strong  faithfulness, 
implying  inconsistent  estimation  with  the  PC-algorithm  for  a  relatively  large  class 
of  DAGs. 

For  trees,  due  to  the  special  structure  of  the  polynomials  defining  the  hyper¬ 
surfaces  of  unfaithful  distributions,  if  the  causal  parameters  are  large,  the  partial 
correlations  tend  to  stay  away  from  these  hypersurfaces  and  strong-faithfulness 
holds  for  a  large  proportion  of  distributions.  However,  as  soon  as  there  are  cy¬ 
cles  in  the  graph  (even  for  sparse  graphs),  the  polynomials  can  cancel  out  also 
for  large  causal  parameters,  and  the  strong-faithfulness  assumption  does  not  hold. 
More  precisely,  if  the  skeleton  is  a  single  cycle,  our  lower  bounds  on  the  propor¬ 
tion  of  restricted  strong-unfaithful  distributions  is  of  the  same  order  of  magnitude 
as  for  trees.  However,  if  the  skeleton  consists  of  multiple  cycles  as,  for  example, 
for  bipartite  graphs,  the  lower  bounds  for  restricted  strong-unfaithful  distributions 
are  as  bad  as  for  plain  strong-unfaithful  distributions. 

Assuming  our  framework  and  in  view  of  the  discussion  above,  in  the  presence  of 
cycles  in  the  skeleton,  the  (conservative)  PC-algorithm  is  not  able  to  consistently 
estimate  the  true  underlying  Markov  equivalence  class  when  p  is  large  relative 
to  n,  even  for  large  causal  parameters  (large  edge  weights).  Some  special  assump¬ 
tions  on  the  sparsity  and  causal  parameters  might  help,  but  without  making  such  as¬ 
sumptions,  the  limitation  is  in  the  range  where  p  =  =  o{^n/log(n)).  This  con¬ 

stitutes  a  severe  limitation  of  the  PC-algorithm.  As  an  alternative  method,  the  pe¬ 
nalized  maximum  likelihood  estimator  (cf.  [3])  does  not  require  strong-faithfulness 
but  instead  a  stronger  version  of  a  beta-min  condition  (i.e.,  sufficiently  large  causal 
parameters)  [15].  This  “permutation  beta-min”  condition  has  been  shown  to  hold 
for  AR(1)  models  in  [15],  page  8.  However,  a  thorough  analysis  of  the  “permu¬ 
tation  beta-min”  condition  and  a  comparison  to  the  strong-faithfulness  condition 
more  generally  is  quite  challenging  and  remains  an  interesting  open  problem. 

Throughout  the  paper,  we  have  assumed  that  the  causal  parameters  are  uni¬ 
formly  distributed  in  the  hypercube  [—1,  1]I^L  Since  all  hypersurfaces  correspond¬ 
ing  to  unfaithful  distributions  go  through  the  origin,  a  prior  distribution  which  puts 
more  mass  around  the  origin  (e.g.,  a  Gaussian  distribution)  would  lead  to  a  higher 
proportion  of  strong-unfaithful  distributions,  whereas  a  prior  distribution  which 
puts  more  mass  on  the  boundary  of  the  hypercube  [—1,  1]  would  reduce  the  pro¬ 
portion  of  strong-unfaithful  distributions.  Computing  and  comparing  these  mea¬ 
sures  for  different  priors  would  be  an  interesting  extension  of  our  work.  Another 
interesting  problem  would  be  to  extend  our  results  to  the  case  of  general  error 
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variances  [i.e.,  var(e;)  =  aj].  Finally,  very  recently  the  ^-triangle-faithfulness  as- 
sumption  has  been  proposed  [13]  as  a  sufficient  condition  for  uniform  consistency 
for  inferring  certain  features  of  the  causal  structure.  This  assumption  is  less  restric¬ 
tive  than  strong-faithfulness,  at  the  cost  of  decreasing  identifiahility,  returning  a 
statement  “undecidahle”  for  some  cases.  Analyzing  how  restrictive  the  A:-triangle- 
faithfulness  assumption  is  and  what  it  means  for  the  high-dimensional  setting  rep¬ 
resents  an  interesting  future  direction. 

8.  Proofs. 


Proof  of  Proposition  4. 1 .  Statement  (i)  follows  from  the  matrix  inversion 
formula  using  the  cofactor  matrix,  that  is. 


= 


1 


det(.^f) 


and  the  fact  that  the  concentration  matrix  K  is  positive  definite  and  therefore 
det(^)  >  0.  Statement  (ii)  is  a  well-known  fact  about  the  multivariate  Gaussian 
distribution. 

Let  A,  B  c  V  he  two  subsets  of  vertices.  We  denote  by  Kab  the  submatrix  of 
K  consisting  of  the  entries  Kfj,  where  {i,  j)  e  A  x  B.  Let  Ka  denote  the  concen¬ 
tration  matrix  in  the  Gaussian  model,  where  we  marginalized  over  =  V  \  A. 
With  these  definitions,  we  have  that 

Ka  =  ^A- 

The  correlation  between  Z,  and  Xj  conditioned  on  S  corresponds  to  the  (/,  y)th 
entry  in  the  matrix  Kq.  Using  the  Schur  complement  formula,  we  get  that 

(12)  Kq  —  Kqq  ~  ^Kqcq. 

Since  Kqcqc  is  positive  definite,  we  can  rewrite  equation  (12)  as 

det{KQCQc)KQ  =  det{KQCQc)KQQ  —  K  qqc  C  (K  qc  qc)  K  qc  q  , 

from  which  statement  (iii)  follows.  □ 


Proof  of  (10).  We  first  note  that  the  (/,  7)th  element  of  A^  consists  of  the 
sum  of  the  weights  of  all  paths  p  =  {po,  p\,  ■  ■  ■ ,  Ps)  with  pQ  =  i  and  ps  =  j  for 
which  {pk-\,  Pk)  ^  E  for  all  k  =  1, . . . ,  5.  This  means  that  (A^)ij  corresponds  to 
all  “forward”  paths  from  i  to  j  of  length  s.  Analogously,  (A^)'"  corresponds  to  all 
“backward”  paths  from  i  to  j  of  length  r. 

We  decompose  the  covariance  matrix  using  the  Neumann  power  series.  We  can 
do  this  since  all  eigenvalues  of  the  matrix  A  are  zero  (because  A  is  upper  triangu- 
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lar). 

S  =  ((/-A)(/-A)^)-^ 

OO 

=  E  E 

k=0r+s=k 

/::=0  r-\-s=k, 
r,s<p—l 

For  the  last  inequality,  we  used  the  assumption  that  the  underlying  graph  is  acyclic. 
Using  the  path  interpretation  it  is  clear  that  for  acyclic  graphs  the  matrix  A^  is  the 
zero-matrix  for  all  5  >  p.  □ 


Proof  of  Corollary  4.3.  To  prove  (a),  we  first  consider  the  special  case 
where  G  is  a  directed  line  on  p  nodes,  where  all  edges  point  in  the  same  direction, 
that  is,  (/,  /  +  1)  G  F  for  1  <  /  <  p.  The  following  argument  can  then  easily  he 
generalized  to  directed  trees  Tp. 

Let  i,  j  G  V  and  without  loss  of  generality  we  assume  that  i  <  j .  Since  there 
are  no  colliders  in  G,  it  follows  from  (9)  that 

^  _  I  —aij,  if  j  is  a  child  of  i, 

1 0,  otherwise, 

'Eij  corresponds  to  all  collider-free  paths  from  i  to  j  and  therefore 
(13)  Y^ij  =  {I  +  af_^ j{l  +  ■  ■  (1  +<212))))  fl  ‘^k,k+i- 

k=i 


The  first  term  corresponds  to  the  value  of  all  collider-free  loops  from  i  to  i  and  the 
second  term  to  the  value  of  the  path  from  i  to  j. 

Let  S  C  V  \  {/,  _/  }  and  2  =  S  U  {/,  j}.  If  there  exists  an  element  s  G  S  such  that 
i  <  s  <  j,  then  the  Cl  relation  Xi  -iL  Xj  |  X5  is  already  entailed  hy  the  Markov 
condition.  We  can  therefore  assume  without  loss  of  generality  that  there  is  no  5  G  5 
such  that  i  <  s  <  j.  Since  there  are  no  colliders  in  G,  it  follows  from  Proposi¬ 
tion  4.1(iii)  that  the  corresponding  polynomial  is  of  the  form 


(14) 


—  det{KQCQc)aij ,  if  j  is  a  child  of  i, 

-  aipC(KQCQc)pqaqj,  otherwise. 
p,q^Q‘' 


The  corresponding  symmetrized  and  reweighted  graph  G  for  p  =  5  is  shown 
in  Figure  12(a).  Note  that  there  is  a  unique  self-avoiding  path  between  any  two 
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vertices.  As  a  consequence,  the  polynomial  corresponding  to  the  Cl  relation  Z,  JL 
Xj  I  Xs  in  (14)  can  he  written  as 


(15) 


ak,k+\, 


where  P  =  \  {i  +  I, j  —  1}. 

We  now  analyze  the  cycles  in  P.  We  decompose  P  into  intervals  P  =  P\{J 
■  ■  ■  U  Ps,  where  P,  =  {p^ ,  +  I, ,  p'^}.  We  need  to  distinguish  two  cases.  If 

I  ^  _ 

pP  =  p,  then  the  subgraph  G p-  is  of  the  form  as  shown  in  Figure  12(a)  (for  p-  =  1 
and  p'^  =  5).  Otherwise  the  subgraph  is  of  the  form  as  shown  in  Figure  12(b)  (for 
pr  =  1  and  p+  =  5). 

We  note  that  all  cycles  are  either  of  length  1  (with  value  — ^_|_i)  or  of  length  2 

(with  value  where  pf  =  p  all  cycles  of  length  1  cancel  with 

the  cycles  of  length  2.  In  the  case  where  Pt<  p,  however,  the  cycle  of  length  1 

with  value  —a^+  +  ,  does  not  cancel  and  therefore  neither  does  the  combination 
A  >Pi +1 

of  k  cycles 


k-i 

n 

i=o 


{-a^+ 

^  Pj- 


for  any  k  &  ,  p^  —  p^  }.  As  a  consequence,  the  polynomial  corresponding  to 

the  Cl  relation  Z;  JL  Zj  |  Z5  in  (15)  can  be  written  as 


s  f-l 

-  0(1  +«J+_i,p+(l  +  «J+_2,p+-i(-  •  •  (1  +‘2j-,pr+i))))  0 

i=\  k=i 

The  proofs  for  (b)  and  (c)  are  analogous  and  basically  require  understanding  the 
cycles  in  G.  □ 
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